-
Notifications
You must be signed in to change notification settings - Fork 495
TOOLS/DEVICE: support channel id in perftest #10993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
WalkthroughAdded a device endpoint channel count parameter and propagated it through perf test initialization into CUDA context and kernel code. CUDA runtime now computes a per-thread channel_id (threadIdx.x % num_channels) and uses it in device_put_* calls; device-side arrays are copied via a new device_vector helper. Changes
Sequence DiagramsequenceDiagram
participant Config as Config (libperf)
participant Test as Test Init
participant Host as Host runtime
participant GPU as CUDA Kernel
participant NIC as Device ops
Config->>Test: device_ep_channel_count
Test->>Host: populate params (num_channels)
Host->>Host: device_vector(copy indices/offsets/lengths)
Host->>GPU: launch kernel (params, device arrays)
GPU->>GPU: channel_id = threadIdx.x % num_channels
GPU->>NIC: device_put_* (with channel_id)
NIC-->>GPU: completion
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used🧠 Learnings (1)📚 Learning: 2025-11-06T09:04:19.215ZApplied to files:
🧬 Code graph analysis (1)src/tools/perf/cuda/ucp_cuda_kernel.cu (1)
🔇 Additional comments (3)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
contrib/test_jenkins.sh(1 hunks)contrib/ucx_perftest_config/test_types_ucp_device_cuda(2 hunks)src/tools/perf/api/libperf.h(3 hunks)src/tools/perf/cuda/cuda_kernel.cuh(4 hunks)src/tools/perf/cuda/ucp_cuda_kernel.cu(7 hunks)src/tools/perf/perftest.c(2 hunks)src/tools/perf/perftest.h(1 hunks)src/tools/perf/perftest_params.c(3 hunks)test/gtest/common/test_perf.cc(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
src/tools/perf/perftest.c (2)
src/tools/perf/lib/ucp_tests.cc (4)
params(197-219)params(197-199)params(221-231)params(221-221)src/ucs/sys/sys.c (1)
ucs_get_page_size(530-543)
src/tools/perf/perftest_params.c (1)
src/tools/perf/lib/uct_tests.cc (2)
params(118-130)params(118-118)
src/tools/perf/cuda/ucp_cuda_kernel.cu (1)
src/ucp/api/device/ucp_device_impl.h (5)
ucs_status_t(83-104)ucp_device_progress_req(441-451)ucp_device_put_single(143-166)ucp_device_put_multi(264-290)ucp_device_put_multi_partial(347-377)
| if ((threadIdx.x == 0) && ucs_unlikely(completed >= m_next_report_iter)) { | ||
| ucx_perf_cuda_time_t cur_time = ucx_perf_cuda_get_time_ns(); | ||
| ucx_perf_cuda_time_t iter_time = (cur_time - m_last_report_time) / | ||
| (completed - m_ctx.completed_iters); | ||
| m_last_report_time = cur_time; | ||
| m_ctx.completed_iters = completed; | ||
| __threadfence_system(); | ||
|
|
||
| m_next_report_iter = ucs_min(completed + (m_report_interval_ns / iter_time), | ||
| m_max_iters); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard against zero deltas before dividing.
Here we divide by both (completed - m_ctx.completed_iters) and iter_time. If the progress callback fires twice with the same completed (e.g. because m_report_interval_ns / iter_time rounded to zero) or if the GPU clock hasn’t advanced yet, both denominators become zero and the SM traps with a divide-by-zero. Please bail out when delta == 0 and clamp iter_time to at least 1 before the final division.
@@
- if ((threadIdx.x == 0) && ucs_unlikely(completed >= m_next_report_iter)) {
- ucx_perf_cuda_time_t cur_time = ucx_perf_cuda_get_time_ns();
- ucx_perf_cuda_time_t iter_time = (cur_time - m_last_report_time) /
- (completed - m_ctx.completed_iters);
+ if ((threadIdx.x == 0) && ucs_unlikely(completed >= m_next_report_iter)) {
+ ucx_perf_counter_t delta_iters = completed - m_ctx.completed_iters;
+ if (delta_iters == 0) {
+ return;
+ }
+
+ ucx_perf_cuda_time_t cur_time = ucx_perf_cuda_get_time_ns();
+ ucx_perf_cuda_time_t delta_time = cur_time - m_last_report_time;
+ if (delta_time == 0) {
+ delta_time = 1;
+ }
+
+ ucx_perf_cuda_time_t iter_time = ucs_max<ucx_perf_cuda_time_t>(delta_time / delta_iters, 1);
m_last_report_time = cur_time;
m_ctx.completed_iters = completed;
__threadfence_system();
- m_next_report_iter = ucs_min(completed + (m_report_interval_ns / iter_time),
- m_max_iters);
+ ucx_perf_cuda_time_t next_interval = m_report_interval_ns / iter_time;
+ if (next_interval == 0) {
+ next_interval = 1;
+ }
+ m_next_report_iter = ucs_min(completed + next_interval, m_max_iters);
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if ((threadIdx.x == 0) && ucs_unlikely(completed >= m_next_report_iter)) { | |
| ucx_perf_cuda_time_t cur_time = ucx_perf_cuda_get_time_ns(); | |
| ucx_perf_cuda_time_t iter_time = (cur_time - m_last_report_time) / | |
| (completed - m_ctx.completed_iters); | |
| m_last_report_time = cur_time; | |
| m_ctx.completed_iters = completed; | |
| __threadfence_system(); | |
| m_next_report_iter = ucs_min(completed + (m_report_interval_ns / iter_time), | |
| m_max_iters); | |
| if ((threadIdx.x == 0) && ucs_unlikely(completed >= m_next_report_iter)) { | |
| ucx_perf_counter_t delta_iters = completed - m_ctx.completed_iters; | |
| if (delta_iters == 0) { | |
| return; | |
| } | |
| ucx_perf_cuda_time_t cur_time = ucx_perf_cuda_get_time_ns(); | |
| ucx_perf_cuda_time_t delta_time = cur_time - m_last_report_time; | |
| if (delta_time == 0) { | |
| delta_time = 1; | |
| } | |
| ucx_perf_cuda_time_t iter_time = ucs_max<ucx_perf_cuda_time_t>(delta_time / delta_iters, 1); | |
| m_last_report_time = cur_time; | |
| m_ctx.completed_iters = completed; | |
| __threadfence_system(); | |
| ucx_perf_cuda_time_t next_interval = m_report_interval_ns / iter_time; | |
| if (next_interval == 0) { | |
| next_interval = 1; | |
| } | |
| m_next_report_iter = ucs_min(completed + next_interval, m_max_iters); | |
| } |
🤖 Prompt for AI Agents
In src/tools/perf/cuda/cuda_kernel.cuh around lines 52 to 61, the code can
divide by zero when computing per-iteration times because both delta =
(completed - m_ctx.completed_iters) and iter_time can be zero; compute and store
delta first, if delta == 0 bail out/skip the reporting block to avoid dividing
by zero, then compute iter_time and clamp it to at least 1 (e.g., use a max with
1) before using it in any division or in calculating m_next_report_iter so no
division by zero occurs.
| if (fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window)) { | ||
| req = nullptr; | ||
| flags = static_cast<ucp_device_flags_t>(0); | ||
| } else { | ||
| req = &m_requests[index]; | ||
| m_pending_map |= UCS_BIT(index); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not hand out a null request under flow control
When get_request() returns req == nullptr, the very next call to ucp_perf_cuda_send_async() passes that pointer straight into ucp_device_put_single/multi/multi_partial, all of which dereference req unconditionally. With device_fc_window > 1, we therefore crash every time this branch fires. Please always return a valid ucp_device_request_t* here and keep the existing bitmap semantics for the FC “prefetch” case.
- if (fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window)) {
- req = nullptr;
- flags = static_cast<ucp_device_flags_t>(0);
- } else {
- req = &m_requests[index];
- m_pending_map |= UCS_BIT(index);
- }
+ req = &m_requests[index];
+ if (fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window)) {
+ flags = static_cast<ucp_device_flags_t>(0);
+ } else {
+ m_pending_map |= UCS_BIT(index);
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window)) { | |
| req = nullptr; | |
| flags = static_cast<ucp_device_flags_t>(0); | |
| } else { | |
| req = &m_requests[index]; | |
| m_pending_map |= UCS_BIT(index); | |
| } | |
| req = &m_requests[index]; | |
| if (fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window)) { | |
| flags = static_cast<ucp_device_flags_t>(0); | |
| } else { | |
| m_pending_map |= UCS_BIT(index); | |
| } |
| case 'F': | ||
| return parse_int(opt_arg, ¶ms->super.device_fc_window, | ||
| "device flow control window size", 1, INT_MAX); | ||
| case 'y': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the -F parser’s target type.
parse_int() expects an int*, but we are handing it an unsigned*. With -Werror (the default in CI) this is an incompatible pointer type error, so the whole build breaks. Please parse into a local int and assign it back once validation succeeds.
@@
- case 'F':
- return parse_int(opt_arg, ¶ms->super.device_fc_window,
- "device flow control window size", 1, INT_MAX);
+ case 'F': {
+ int fc_window;
+
+ status = parse_int(opt_arg, &fc_window,
+ "device flow control window size", 1, INT_MAX);
+ if (status == UCS_OK) {
+ params->super.device_fc_window = fc_window;
+ }
+ return status;
+ }🤖 Prompt for AI Agents
In src/tools/perf/perftest_params.c around lines 728 to 731, the -F option
passes an unsigned int pointer to parse_int which expects an int*, causing an
incompatible pointer type error under -Werror; fix it by parsing into a local
int variable (call parse_int with &local_int and the same bounds), check
parse_int succeeded, then assign local_int to params->super.device_fc_window
(cast/convert to the unsigned field) after validation; ensure you preserve the
same bounds and error handling behavior.
7355ab4 to
84da1e1
Compare
|
|
||
| void init_params(const ucx_perf_context_t &perf) | ||
| { | ||
| m_params.num_channels = perf.params.device_ep_channel_count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just set it in ucp_perf_cuda_params_handler and remove this func.
What?
Support channel id in ucx perftest.
Why?
Improve performance by distribute request on qps.
How?
Select channel by thread id modulo numbers of channel.
Summary by CodeRabbit
New Features
Chores